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We review machine learning methods employing positive definite 
kernels. These methods formulate learning and estimation problems 
in a reproducing kernel Hilbert space (RKHS) of functions defined 
on the data domain, expanded in terms of a kernel. Working in linear 
spaces of function has the benefit of facilitating the construction and 
analysis of learning algorithms while at the same time allowing large 
classes of functions. The latter include nonlinear functions as well as 
functions defined on nonvectorial data. 

We cover a wide range of methods, ranging from binary classifiers 
to sophisticated methods for estimation with structured data. 

1. Introduction. Over the last ten years estimation and learning meth- 
ods utilizing positive definite kernels have become rather popular, particu- 
larly in machine learning. Since these methods have a stronger mathematical 
slant than earlier machine learning methods (e.g., neural networks), there 
is also significant interest in the statistics and mathematics community for 
these methods. The present review aims to summarize the state of the art on 
a conceptual level. In doing so, we build on various sources, including Burges 
[25], Cristianini and Shawe- Taylor [37], Herbrich [64] and Vapnik [141] and, 
in particular, Scholkopf and Smola [118], but we also add a fair amount of 
more recent material which helps unifying the exposition. We have not had 
space to include proofs; they can be found either in the long version of the 
present paper (see Hofmann et al. [69]), in the references given or in the 
above books. 

The main idea of all the described methods can be summarized in one 
paragraph. Traditionally, theory and algorithms of machine learning and 
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statistics has been very well developed for the linear case. Real world data 
analysis problems, on the other hand, often require nonlinear methods to de- 
tect the kind of dependencies that allow successful prediction of properties 
of interest. By using a positive definite kernel, one can sometimes have the 
best of both worlds. The kernel corresponds to a dot product in a (usually 
high-dimensional) feature space. In this space, our estimation methods are 
linear, but as long as we can formulate everything in terms of kernel evalu- 
ations, we never explicitly have to compute in the high-dimensional feature 
space. 

The paper has three main sections: Section 2 deals with fundamental 
properties of kernels, with special emphasis on (conditionally) positive defi- 
nite kernels and their characterization. We give concrete examples for such 
kernels and discuss kernels and reproducing kernel Hilbert spaces in the con- 
text of regularization. Section 3 presents various approaches for estimating 
dependencies and analyzing data that make use of kernels. We provide an 
overview of the problem formulations as well as their solution using convex 
programming techniques. Finally, Section 4 examines the use of reproduc- 
ing kernel Hilbert spaces as a means to define statistical models, the focus 
being on structured, multidimensional responses. We also show how such 
techniques can be combined with Markov networks as a suitable framework 
to model dependencies between response variables. 

2. Kernels. 

2.1. An introductory example. Suppose we are given empirical data 



Here, the domain X is some nonempty set that the inputs (the predictor 
variables) x.; are taken from; the yi^y are called targets (the response vari- 
able). Here and below, i,j € [n], where we use the notation [n] := {1, .. . ,n}. 

Note that we have not made any assumptions on the domain X other 
than it being a set. In order to study the problem of learning, we need 
additional structure. In learning, we want to be able to generalize to unseen 
data points. In the case of binary pattern recognition, given some new input 
x £ X , we want to predict the corresponding y £ {±1} (more complex output 
domains y will be treated below). Loosely speaking, we want to choose y 
such that (x,y) is in some sense similar to the training examples. To this 
end, we need similarity measures in X and in {±1}. The latter is easier, 
as two target values can only be identical or different. For the former, we 
require a function 





(2) 
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Fig. 1. A simple geometric classification algorithm: given two classes of points (de- 
picted by "o" and "+"), compute their means c+,C- and assign a test input x to the 
one whose mean is closer. This can be done by looking at the dot product between x — c 
[where c — (c+ +c_)/2] and w := c+ — c_, which changes sign as the enclosed angle passes 
through tt/2. Note that the corresponding decision boundary is a hyperplane (the dotted 
line) orthogonal to w (from Scholkopf and Smola [118]). 



satisfying, for all x,x' € X, 

(3) k(x,x) = ($(x),$(x')), 

where $ maps into some dot product space Ti, sometimes called the feature 
space. The similarity measure k is usually called a kernel, and $ is called its 
feature map. 

The advantage of using such a kernel as a similarity measure is that 
it allows us to construct algorithms in dot product spaces. For instance, 
consider the following simple classification algorithm, described in Figure 1, 
where y = {±1}. The idea is to compute the means of the two classes in 
the feature space, c+ = ^ £{^=+1} $(s»), and c_ = srE{i: W =-i} 
where n+ and n_ are the number of examples with positive and negative 
target values, respectively. We then assign a new point $(x) to the class 
whose mean is closer to it. This leads to the prediction rule 



(4) y = sgn«$(s),c+) - (*(a?),&_) + b) 

l 

2 



with b = i(||c_|| 2 — ||c + || 2 ). Substituting the expressions for c± yields 



(5) y = sgn(— (*(*), E (*(*), +& J, 

V {^=+ 1 } k(x>Xi) {<:»=-!} k(x>Xi) ' 

where b=\{^r J2{(i,jy. yi = yj =-i} K x ii x j) ~ ^ Z{{ij):vt=Vi=+i} k (xi,Xj))- 

Let us consider one well-known special case of this type of classifier. As- 
sume that the class means have the same distance to the origin (hence, 
6 = 0), and that k(-,x) is a density for all x € X. If the two classes are 
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equally likely and were generated from two probability distributions that 
are estimated 

(6) p + (x):= — V k(x,Xi), p-(x):=— V k(x,Xi), 

+ {i: yi =+l} {r.yi=-l} 

then (5) is the estimated Bayes decision rule, plugging in the estimates p + 
and p- for the true densities. 

The classifier (5) is closely related to the Support Vector Machine (SVM) 
that we will discuss below. It is linear in the feature space (4), while in the 
input domain, it is represented by a kernel expansion (5). In both cases, the 
decision boundary is a hyperplane in the feature space; however, the normal 
vectors [for (4), w = c+ — c_] are usually rather different. 

The normal vector not only characterizes the alignment of the hyperplane, 
its length can also be used to construct tests for the equality of the two class- 
generating distributions (Borgwardt et al. [22]). 

As an aside, note that if we normalize the targets such that yi = Ui/\{j '"■ yj = 
yi}\, in which case the yi sum to zero, then \\w\\ 2 = (K,yy T )F, where (•, -)p 
is the Frobenius dot product. If the two classes have equal size, then up to a 
scaling factor involving \\K\\2 and n, this equals the kernel-target alignment 
defined by Cristianini et al. [38]. 

2.2. Positive definite kernels. We have required that a kernel satisfy (3), 
that is, correspond to a dot product in some dot product space. In the 
present section we show that the class of kernels that can be written in the 
form (3) coincides with the class of positive definite kernels. This has far- 
reaching consequences. There are examples of positive definite kernels which 
can be evaluated efficiently even though they correspond to dot products in 
infinite dimensional dot product spaces. In such cases, substituting k(x,x') 
for (Q(x), $>(x')), as we have done in (5), is crucial. In the machine learning 
community, this substitution is called the kernel trick. 

Definition 1 (Gram matrix). Given a kernel k and inputs x±, . . . ,x n 6 
X , the n x n matrix 

(7) K := (k(xi,Xj))ij 

is called the Gram matrix (or kernel matrix) of k with respect to x\, . . . , x n . 

Definition 2 (Positive definite matrix). A real nxn symmetric matrix 
Kij satisfying 

(8) J^CiCjKij^O 

for all q £ 1 is called positive definite. If equality in (8) only occurs for 
c\ = ■ ■ ■ = c n = 0, then we shall call the matrix strictly positive definite. 
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Definition 3 (Positive definite kernel). Let X be a nonempty set. A 
function k : X x X — ► R which for all n G N, Xj £ X, i G [n] gives rise to a 
positive definite Gram matrix is called a positive definite kernel. A function 
k : X x X — > R which for all n G N and distinct Xi £ X gives rise to a strictly 
positive definite Gram matrix is called a strictly positive definite kernel. 

Occasionally, we shall refer to positive definite kernels simply as kernels. 
Note that, for simplicity, we have restricted ourselves to the case of real 
valued kernels. However, with small changes, the below will also hold for the 
complex valued case. 

Since QCj (^(^i), = (E; (H®(xi), £j c j$( x j)) > 0, kernels of the 

form (3) are positive definite for any choice of $. In particular, if X is already 
a dot product space, we may choose <3? to be the identity. Kernels can thus be 
regarded as generalized dot products. While they are not generally bilinear, 
they share important properties with dot products, such as the Cauchy- 
Schwarz inequality: If k is a positive definite kernel, and x±,x 2 G X, then 

(9) k(xi, x 2 ) 2 < k(xi,xi) ■ k(x 2 , x 2 ). 

2.2.1. Construction of the reproducing kernel Hilbert space. We now de- 
fine a map from X into the space of functions mapping X into R, denoted 
as M. x , via 

(10) <P:X->R X where x^k(-,x). 

Here, <&(x) = k(-,x) denotes the function that assigns the value k(x',x) to 

x' ex. 

We next construct a dot product space containing the images of the inputs 
under <!>. To this end, we first turn it into a vector space by forming linear 
combinations 

n 

(11) f{-) = Y,^K; Xi ). 

i=l 

Here, n € N, cnj S R and Xi G X are arbitrary. 

Next, we define a dot product between / and another function g(-) = 
Y%Li 0jk{; x'j) (with n' G N, 0j G R and x) G X) as 

n n' 

(12) (f,g):=J2J2 a ^ k ^ x 'j)- 

i=ij=i 

To see that this is well defined although it contains the expansion coefficients 
and points, note that (/,<?) = Ej=i Pjf{ x 'j)- The latter, however, does not 
depend on the particular expansion of /. Similarly, for g, note that {f,g) = 
Y^i=i a i9( x i)- This also shows that (•, •) is bilinear. It is symmetric, as (/, g) = 
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(g, f). Moreover, it is positive definite, since positive definiteness of k implies 
that, for any function /, written as (11), we have 

n 

(13) (/,/) = <Xi<Xjk(xi,Xj)>0. 

Next, note that given functions fi,...,f p , and coefficients 71, . . . ,7 P £ R, we 
have 

(14) E -r,:/,../;, (y.',r;i"jf) 

i,j=l \ i=l j=l I 

Here, the equality follows from the bilinearity of (•,•), and the right-hand 
inequality from (13). 

By (14), (•, •) is a positive definite kernel, defined on our vector space of 
functions. For the last step in proving that it even is a dot product, we note 
that, by (12), for all functions (11), 

(15) (k(-, x), f) = f(x) and, in particular, (k(-,x), k(-,x )) = k(x,x ). 

By virtue of these properties, k is called a reproducing kernel (Aronszajn 
[7])- 

Due to (15) and (9), we have 

(16) \f(x)\ 2 = \(k(;x)J)\ 2 <k(x,x) ■(/,/). 

By this inequality, (/, /) = implies / = 0, which is the last property that 
was left to prove in order to establish that (•, •) is a dot product. 

Skipping some details, we add that one can complete the space of func- 
tions (11) in the norm corresponding to the dot product, and thus gets a 
Hilbert space Tt, called a reproducing kernel Hilbert space (RKHS). 

One can define a RKHS as a Hilbert space 7i of functions on a set X with 
the property that, for all x 6 X and / GTC, the point evaluations / 1— ► f(x) 
are continuous linear functionals [in particular, all point values f(x) are well 
defined, which already distinguishes RKHSs from many L2 Hilbert spaces]. 
From the point evaluation functional, one can then construct the reproduc- 
ing kernel using the Riesz representation theorem. The Moore-Aronszajn 
theorem (Aronszajn [7]) states that, for every positive definite kernel on 
X x X, there exists a unique RKHS and vice versa. 

There is an analogue of the kernel trick for distances rather than dot 
products, that is, dissimilarities rather than similarities. This leads to the 
larger class of conditionally positive definite kernels. Those kernels are de- 
fined just like positive definite ones, with the one difference being that their 
Gram matrices need to satisfy (8) only subject to 



(17) 



i=i 
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Interestingly, it turns out that many kernel algorithms, including SVMs and 
kernel PC A (see Section 3), can be applied also with this larger class of 
kernels, due to their being translation invariant in feature space (Hein et al. 
[63] and Schdlkopf and Smola [118]). 

We conclude this section with a note on terminology. In the early years of 
kernel machine learning research, it was not the notion of positive definite 
kernels that was being used. Instead, researchers considered kernels satis- 
fying the conditions of Mercer's theorem (Mercer [99], see, e.g., Cristianini 
and Shawe- Taylor [37] and Vapnik [141]). However, while all such kernels do 
satisfy (3), the converse is not true. Since (3) is what we are interested in, 
positive definite kernels are thus the right class of kernels to consider. 

2.2.2. Properties of positive definite kernels. We begin with some closure 
properties of the set of positive definite kernels. 

Proposition 4. Below, k\,k2,--- are arbitrary positive definite kernels 
on X x X , where X is a nonempty set: 

(i) The set of positive definite kernels is a closed convex cone, that is, 
(a) if a\,CL2 > 0, then a\ki + 0:2^2 is positive definite; and (b) if k(x,x') := 
lim^^oo k n (x, x') exists for all x,x' , then k is positive definite. 

(ii) The pointwise product k±k2 is positive definite. 

(hi) Assume that for i = l,2, ki is a positive definite kernel on Xi x Xi, 
where Xi is a nonempty set. Then the tensor product k\ ® k<i and the direct 
sum k\ © ki are positive definite kernels on (X\ x X2) x {X\ x X2). 

The proofs can be found in Berg et al. [18]. 

It is reassuring that sums and products of positive definite kernels are 
positive definite. We will now explain that, loosely speaking, there are no 
other operations that preserve positive definiteness. To this end, let C de- 
note the set of all functions — > 1R that map positive definite kernels to 
(conditionally) positive definite kernels (readers who are not interested in 
the case of conditionally positive definite kernels may ignore the term in 
parentheses). We define 

C := {ip\k is a p.d. kernel ip(k) is a (conditionally) p.d. kernel}, 

C = {tp\ for any Hilbert space T , 

ip({x,x')jr) is (conditionally) positive definite}, 
C" = {ip\ for all n G N: K is a p.d. 

n x n matrix ^p(K) is (conditionally) p.d.}, 
where vp(K) is the n x n matrix with elements ip(Kij). 



<s 



T. HOFMANN, B. SCHOLKOPF AND A. J. SMOLA 



Propositions. C = C' = C". 

The following proposition follows from a result of FitzGerald et al. [50] for 
(conditionally) positive definite matrices; by Proposition 5, it also applies for 
(conditionally) positive definite kernels, and for functions of dot products. 
We state the latter case. 

Proposition 6. Let tp : R —> R. Then ip((x, x') is positive definite for 
any Hilbert space T if and only if ip is real entire of the form 

oo 

(18) = E a "*" 

n=0 

with a n > for n > 0. 

Moreover, ip({x,x')jr) is conditionally positive definite for any Hilbert 
space J- if and only if ijj is real entire of the form (18) with a n > for 
n> 1. 

There are further properties of k that can be read off the coefficients a n : 

• Steinwart [128] showed that if all a n are strictly positive, then the ker- 
nel of Proposition 6 is universal on every compact subset 5 of R d in the 
sense that its RKHS is dense in the space of continuous functions on S in 
the || • | loo norm. For support vector machines using universal kernels, he 
then shows (universal) consistency (Steinwart [129]). Examples of univer- 
sal kernels are (19) and (20) below. 

• In Lemma 11 we will show that the ao term does not affect an SVM. 
Hence, we infer that it is actually sufficient for consistency to have a n > 
for n > 1. 

We conclude the section with an example of a kernel which is positive definite 
by Proposition 6. To this end, let X be a dot product space. The power series 
expansion of ip(x) = e x then tells us that 

(19) k(x,x') = e^ x '^ a2 

is positive definite (Haussler [62]). If we further multiply k with the positive 
definite kernel f(x)f(x'), where f(x) = e~" x " l 2a and a > 0, this leads to 
the positive definiteness of the Gaussian kernel 

(20) k'(x,x') = k(x,x')f(x)f(x') = e-ll*-*'H 2 /(2<x 2 ). 
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2.2.3. Properties of positive definite functions. We now let X = M and 
consider positive definite kernels of the form 

(21) k(x,x') = h(x- x'), 

in which case h is called a positive definite function. The following charac- 
terization is due to Bochner [21]. We state it in the form given by Wendland 
[152]. 

Theorem 7. A continuous function h on ]R rf is positive definite if and 
only if there exists a finite nonnegative Borel measure // on U. d such that 

(22) h{x)= [ e^^(w). 

While normally formulated for complex valued functions, the theorem 
also holds true for real functions. Note, however, that if we start with an 
arbitrary nonnegative Borel measure, its Fourier transform may not be real. 
Real- valued positive definite functions are distinguished by the fact that the 
corresponding measures \i are symmetric. 

We may normalize h such that h(0) = 1 [hence, by (9), \h(x)\ < 1], in 
which case \i is a probability measure and h is its characteristic function. For 
instance, if [i is a normal distribution of the form (27r/cr 2 ) _d//2 e~ <J "^'l / 2 <ko, 
then the corresponding positive definite function is the Gaussian e - "^" ); 
see (20). 

Bochner's theorem allows us to interpret the similarity measure k(x,x') = 
h(x — x') in the frequency domain. The choice of the measure [i determines 
which frequency components occur in the kernel. Since the solutions of kernel 
algorithms will turn out to be finite kernel expansions, the measure \x will 
thus determine which frequencies occur in the estimates, that is, it will 
determine their regularization properties — more on that in Section 2.3.2 
below. 

Bochner's theorem generalizes earlier work of Mathias, and has itself been 
generalized in various ways, that is, by Schoenberg [115]. An important 
generalization considers Abelian semigroups (Berg et al. [18]). In that case, 
the theorem provides an integral representation of positive definite functions 
in terms of the semigroup's semicharacters. Further generalizations were 
given by Krein, for the cases of positive definite kernels and functions with 
a limited number of negative squares. See Stewart [130] for further details 
and references. 

As above, there are conditions that ensure that the positive definiteness 
becomes strict. 

Proposition 8 (Wendland [152]). A positive definite function is strictly 
positive definite if the carrier of the measure in its representation (22) con- 
tains an open subset. 
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This implies that the Gaussian kernel is strictly positive definite. 

An important special case of positive definite functions, which includes 
the Gaussian, are radial basis functions. These are functions that can be 
written as h(x) = g{\\x\\2) for some function g : [0, co[— > R. They have the 
property of being invariant under the Euclidean group. 

2.2.4. Examples of kernels. We have already seen several instances of 
positive definite kernels, and now intend to complete our selection with a 
few more examples. In particular, we discuss polynomial kernels, convolution 
kernels, ANOVA expansions and kernels on documents. 

Polynomial kernels. From Proposition 4 it is clear that homogeneous poly- 
nomial kernels k(x, x') = (x, x') p are positive definite for p 6 N and x, x' 6 R d . 
By direct calculation, we can derive the corresponding feature map (Poggio 
[108]): 

<x,x7 [X'L; 

\j=l 

(23) 

= Mji [x]j P ■ [Ah [Ai P = (C p {x),C p (x')), 
je[d]p 

where C p maps x € M. d to the vector C p (x) whose entries are all possible 
pth degree ordered products of the entries of x (note that [d] is used as a 
shorthand for {1, . . . ,d}). The polynomial kernel of degree p thus computes 
a dot product in the space spanned by all monomials of degree p in the input 
coordinates. Other useful kernels include the inhomogeneous polynomial, 

(24) k(x,x') = ({x,x') + c) p where p E N and c > 0, 
which computes all monomials up to degree p. 

Spline kernels. It is possible to obtain spline functions as a result of kernel 
expansions (Vapnik et al. [144] simply by noting that convolution of an even 
number of indicator functions yields a positive kernel function. Denote by 
Ix the indicator (or characteristic) function on the set X , and denote by 
(g> the convolution operation, (/ <S> g)(x) := / Kd f{x')g{x' — x) dx' . Then the 
B-spline kernels are given by 

(25) k(x,x') = B 2p+ i(x - x') where pGN with B i+ i :=Bi®B . 

Here Bq is the characteristic function on the unit ball in M. d . From the 
definition of (25), it is obvious that, for odd m, we may write B m as the 
inner product between functions B m j 2 - Moreover, note that, for even m, B m 
is not a kernel. 
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Convolutions and structures. Let us now move to kernels denned on struc- 
tured objects (Haussler [62] and Watkins [151]). Suppose the object x G X is 
composed of x p E X p , where p £ [P] (note that the sets X p need not be equal). 
For instance, consider the string x = ATG and P = 2. It is composed of the 
parts x\ = AT and X2 = G, or alternatively, of x\ = A and X2 = TG. Math- 
ematically speaking, the set of "allowed" decompositions can be thought 
of as a relation R(x\, . . . ,xp,x), to be read as "x±, . . . ,xp constitute the 
composite object x." 

Haussler [62] investigated how to define a kernel between composite ob- 
jects by building on similarity measures that assess their respective parts; 
in other words, kernels k p defined on X p x X p . Define the R-convolution of 
ki,...,kp as 

p 

(26) [ki*---*k P ](x,x') := ^2 Y[k p (x p ,x p ), 

xeR(x),x' eR(x') p=i 

where the sum runs over all possible ways R(x) and R(x') in which we 
can decompose x into x±,...,xp and x' analogously [here we used the con- 
vention that an empty sum equals zero, hence, if either x or x' cannot be 
decomposed, then (k\ * ■ ■ ■ ■*• kp)(x, x') = 0]. If there is only a finite number 
of ways, the relation R is called finite. In this case, it can be shown that the 
^-convolution is a valid kernel (Haussler [62]). 

ANOVA kernels. Specific examples of convolution kernels are Gaussians 
and ANOVA kernels (Vapnik [141] and Wahba [148]). To construct an ANOVA 
kernel, we consider X = S N for some set S, and kernels fc® on S x S, where 
i = 1, . . . , N. For P = 1, . . . , N, the ANOVA kernel of order P is defined as 

(27) k P (x,x'):= £ {[k^Hx lp ,x' lp ). 

l<h< - <i P <N p=l 

Note that if P = N, the sum consists only of the term for which (ii, . . . ,ip) = 
(1, . . . , N), and k equals the tensor product k^ <8> • • • <8> fe^'. At the other 
extreme, if P = 1, then the products collapse to one factor each, and k equals 
the direct sum k^ © • • • © k^ N h For intermediate values of P, we get kernels 
that lie in between tensor products and direct sums. 

ANOVA kernels typically use some moderate value of P, which specifies 
the order of the interactions between attributes Xi p that we are interested 
in. The sum then runs over the numerous terms that take into account 
interactions of order P; fortunately, the computational cost can be reduced 
to 0{Pd) cost by utilizing recurrent procedures for the kernel evaluation. 
ANOVA kernels have been shown to work rather well in multi-dimensional 
SV regression problems (Stitson et al. [131]). 
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Bag of words. One way in which SVMs have been used for text categoriza- 
tion (Joachims [77]) is the bag-of-words representation. This maps a given 
text to a sparse vector, where each component corresponds to a word, and 
a component is set to one (or some other number) whenever the related 
word occurs in the text. Using an efficient sparse representation, the dot 
product between two such vectors can be computed quickly. Furthermore, 
this dot product is by construction a valid kernel, referred to as a sparse 
vector kernel. One of its shortcomings, however, is that it does not take into 
account the word ordering of a document. Other sparse vector kernels are 
also conceivable, such as one that maps a text to the set of pairs of words 
that are in the same sentence (Joachims [77] and Watkins [151]). 

n-grams and suffix trees. A more sophisticated way of dealing with string 
data was proposed by Haussler [62] and Watkins [151]. The basic idea is 
as described above for general structured objects (26): Compare the strings 
by means of the substrings they contain. The more substrings two strings 
have in common, the more similar they are. The substrings need not always 
be contiguous; that said, the further apart the first and last element of a 
substring are, the less weight should be given to the similarity. Depending 
on the specific choice of a similarity measure, it is possible to define more 
or less efficient kernels which compute the dot product in the feature space 
spanned by all substrings of documents. 

Consider a finite alphabet S, the set of all strings of length n, E n , and 
the set of all finite strings, E* := U^L E n . The length of a string s G E* is 
denoted by |s|, and its elements bys(l)...s(|s|); the concatenation of s and 
t € E* is written st. Denote by 

k(x, x') = s)#(a/, s)c s 

s 

a string kernel computed from exact matches. Here Jf(x, s) is the number of 
occurrences of s in x and c s > 0. 

Vishwanathan and Smola [146] provide an algorithm using suffix trees, 
which allows one to compute for arbitrary c s the value of the kernel k(x,x') 
in 0(\x\ + \x'\) time and memory. Moreover, also f(x) = (w,$(x)) can be 
computed in 0(|x|) time if preprocessing linear in the size of the support 
vectors is carried out. These kernels are then applied to function prediction 
(according to the gene ontology) of proteins using only their sequence in- 
formation. Another prominent application of string kernels is in the field of 
splice form prediction and gene finding (Ratsch et al. [112]). 

For inexact matches of a limited degree, typically up to e = 3, and strings 
of bounded length, a similar data structure can be built by explicitly gener- 
ating a dictionary of strings and their neighborhood in terms of a Hamming 
distance (Leslie et al. [92]). These kernels are defined by replacing 
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by a mismatch function #(x, s, e) which reports the number of approximate 
occurrences of s in x. By trading off computational complexity with storage 
(hence, the restriction to small numbers of mismatches), essentially linear- 
time algorithms can be designed. Whether a general purpose algorithm exists 
which allows for efficient comparisons of strings with mismatches in linear 
time is still an open question. 

Mismatch kernels. In the general case it is only possible to find algorithms 
whose complexity is linear in the lengths of the documents being compared, 
and the length of the substrings, that is, 0(\x\ ■ \x'\) or worse. We now 
describe such a kernel with a specific choice of weights (Cristianini and 
Shawe- Taylor [37] and Watkins [151]). 

Let us now form subsequences u of strings. Given an index sequence i := 
(ii, . . . ,i\ u \) with 1 < i\ < • • • < i\ u \ < \s\, we define u := s(i) := s{i\) . . .s{i\ u \). 
We call Z(i) := i\ u \ — + l the length of the subsequence in s. Note that if i 
is not contiguous, then Z(i) > \u\. 

The feature space built from strings of length n is defined to be TL n := 
This notation means that the space has one dimension (or coordinate) 
for each element of S n , labeled by that element (equivalently, we can think 
of it as the space of all real- valued functions on E n ). We can thus describe 
the feature map coordinate-wise for each u € S n via 

(28) [*»(*)]«:= £ 

i:s(\)=u 

Here, < A < 1 is a decay parameter: The larger the length of the subse- 
quence in s, the smaller the respective contribution to [$ n (s)] u . The sum 
runs over all subsequences of s which equal u. 

For instance, consider a dimension of H.3 spanned (i.e., labeled) by the 
string asd. In this case we have [^3 (Nasdaq)] asd = A 3 , while [$3 (l ass d as)]^ = 
2A 5 . In the first string, asd is a contiguous substring. In the second string, 
it appears twice as a noncontiguous substring of length 5 in lass das, the 
two occurrences are lass das and lass das. 

The kernel induced by the map $ n takes the form 

(29) k n (s,t)= £ [*„(8)]„[*„(t)]«= £ £ A^W 

«SS™ uGE' 1 (i,j):s(i)=t(j)=u 

The string kernel k n can be computed using dynamic programming; see 
Watkins [151]. 

The above kernels on string, suffix-tree, mismatch and tree kernels have 
been used in sequence analysis. This includes applications in document anal- 
ysis and categorization, spam filtering, function prediction in proteins, an- 
notations of dna sequences for the detection of introns and exons, named 
entity tagging of documents and the construction of parse trees. 
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Locality improved kernels. It is possible to adjust kernels to the structure 
of spatial data. Recall the Gaussian RBF and polynomial kernels. When 
applied to an image, it makes no difference whether one uses as x the image 
or a version of x where all locations of the pixels have been permuted. This 
indicates that function space on X induced by k does not take advantage of 
the locality properties of the data. 

By taking advantage of the local structure, estimates can be improved. 
On biological sequences (Zien et al. [157]) one may assign more weight to the 
entries of the sequence close to the location where estimates should occur. 

For images, local interactions between image patches need to be consid- 
ered. One way is to use the pyramidal kernel (DeCoste and Scholkopf [44] 
and Scholkopf [116]). It takes inner products between corresponding image 
patches, then raises the latter to some power pi, and finally raises their sum 
to another power p2- While the overall degree of this kernel is P1P2, the first 
factor pi only captures short range interactions. 

Tree kernels. We now discuss similarity measures on more structured ob- 
jects. For trees Collins and Duffy [31] propose a decomposition method which 
maps a tree x into its set of subtrees. The kernel between two trees x,x' is 
then computed by taking a weighted sum of all terms between both trees. 
In particular, Collins and Duffy [31] show a quadratic time algorithm, that 
is, 0(\x\ ■ \x'\) to compute this expression, where is the number of nodes 
of the tree. When restricting the sum to all proper rooted subtrees, it is 
possible to reduce the computational cost to 0(\x\ + |x'|) time by means of 
a tree to string conversion (Vishwanathan and Smola [146]). 

Graph kernels. Graphs pose a twofold challenge: one may both design a 
kernel on vertices of them and also a kernel between them. In the former 
case, the graph itself becomes the object defining the metric between the 
vertices. See Gartner [56] and Kashima et al. [82] for details on the latter. 
In the following we discuss kernels on graphs. 

Denote by W £ M nxn the adjacency matrix of a graph with W%j > if an 
edge between i,j exists. Moreover, assume for simplicity that the graph is 
undirected, that is, W T = W. Denote by L = D — W the graph Laplacian 
and by L = 1 — D~ l l 2 W D~ l l 2 the normalized graph Laplacian. Here D is 
a diagonal matrix with Da = J2j Wij denoting the degree of vertex i. 

Fiedler [49] showed that, the second largest eigenvector of L approxi- 
mately decomposes the graph into two parts according to their sign. The 
other large eigenvectors partition the graph into correspondingly smaller 
portions. L arises from the fact that for a function / defined on the vertices 
of the graph Eij(/(*) " f(j)f = V T Lf. 

Finally, Smola and Kondor [125] show that, under mild conditions and 
up to rescaling, L is the only quadratic permutation invariant form which 
can be obtained as a linear function of W. 
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Hence, it is reasonable to consider kernel matrices K obtained from L 
(and L). Smola and Kondor [125] suggest kernels K = r(L) or K = r(L), 
which have desirable smoothness properties. Here r : [0, oo) — ► [0, oo) is a 
monotonically decreasing function. Popular choices include 

(30) r(£) = exp(— A£) diffusion kernel, 

(31) r(£) = (£ + A)^ 1 regularized graph Laplacian, 

(32) r(£) = (A — £) p p-step random walk, 

where A > is chosen such as to reflect the amount of diffusion in (30), the 
degree of regularization in (31) or the weighting of steps within a random 
walk (32) respectively. Equation (30) was proposed by Kondor and Lafferty 
[87]. In Section 2.3.2 we will discuss the connection between regularization 
operators and kernels in M. n . Without going into details, the function r(£) 
describes the smoothness properties on the graph and L plays the role of 
the Laplace operator. 

Kernels on sets and subspaces. Whenever each observation Xi consists of 
a set of instances, we may use a range of methods to capture the specific 
properties of these sets (for an overview, see Vishwanathan et al. [147]): 

• Take the average of the elements of the set in feature space, that is, 4>(xi) = 
n <fi( x ij)- This yields good performance in the area of multi-instance 
learning. 

• Jebara and Kondor [75] extend the idea by dealing with distributions 
Pi(x) such that 4>(xi) = E[(f)(x)], where x ~ Pi(x). They apply it to image 
classification with missing pixels. 

• Alternatively, one can study angles enclosed by subspaces spanned by 
the observations. In a nutshell, if U, U' denote the orthogonal matrices 
spanning the subspaces of x and x' respectively, then k(x,x') = detU T U' . 

Fisher kernels. [74] have designed kernels building on probability density 
models p(x\6). Denote by 

(33) Ue(x):=-d e \ogp{x\6) 1 

(34) I:=E x [U e (x)Uj(x)}, 

the Fisher scores and the Fisher information matrix respectively. Note that 
for maximum likelihood estimators E x [C/e(x)] =0 and, therefore, I is the 
covariance of Ug(x). The Fisher kernel is defined as 

(35) k{x,x') := Uj{x)r 1 U e (x') or k{x,x) := Uj{x)U e {x') 

depending on whether we study the normalized or the unnormalized kernel 
respectively. 
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In addition to that, it has several attractive theoretical properties: Oliver 
et al. [104] show that estimation using the normalized Fisher kernel corre- 
sponds to estimation subject to a regularization on the L2(p(-\0)) norm. 

Moreover, in the context of exponential families (see Section 4.1 for a 
more detailed discussion) where p{x\9) = exp((0(x), 9) —g(9)), we have 

(36) k(x, x') = [4>{x) - d e g{9)} [<j>{x') - d e g(9)} 

for the unnormalized Fisher kernel. This means that up to centering by 
deg(9) the Fisher kernel is identical to the kernel arising from the inner 
product of the sufficient statistics <j>(x). This is not a coincidence. In fact, 
in our analysis of nonparametric exponential families we will encounter this 
fact several times (cf. Section 4 for further details). Moreover, note that the 
centering is immaterial, as can be seen in Lemma 11. 

The above overview of kernel design is by no means complete. The reader 
is referred to books of Bakir et al. [9], Cristianini and Shawe- Taylor [37], 
Herbrich [64], Joachims [77], Scholkopf and Smola [118], Scholkopf [121] and 
Shawe- Taylor and Cristianini [123] for further examples and details. 

2.3. Kernel function classes. 

2.3.1. The representer theorem. From kernels, we now move to functions 
that can be expressed in terms of kernel expansions. The representer theo- 
rem (Kimeldorf and Wahba [85] and Scholkopf and Smola [118]) shows that 
solutions of a large class of optimization problems can be expressed as kernel 
expansions over the sample points. As above, TL is the RKHS associated to 
the kernel k. 

Theorem 9 (Representer theorem). Denote by $7: [0, oo) — >M a strictly 
monotonic increasing function, by X a set, and by c:(X x M 2 )™ ->KU {oo} 
an arbitrary loss function. Then each minimizer f £ H of the regularized 
risk functional 

(37) c((x 1 ,y 1 ,f(x 1 )), (x n , y n , f{x n ))) + 

admits a representation of the form 

n 

(38) f(x) = ^2aik(xi,x). 

i=i 

Monotonicity of f2 does not prevent the regularized risk functional (37) 
from having multiple local minima. To ensure a global minimum, we would 
need to require convexity. If we discard the strictness of the monotonicity, 
then it no longer follows that each minimizer of the regularized risk admits 
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an expansion (38); it still follows, however, that there is always another 
solution that is as good, and that does admit the expansion. 

The significance of the representer theorem is that although we might be 
trying to solve an optimization problem in an infinite-dimensional space 7i, 
containing linear combinations of kernels centered on arbitrary points of X , 
it states that the solution lies in the span of n particular kernels — those 
centered on the training points. We will encounter (38) again further below, 
where it is called the Support Vector expansion. For suitable choices of loss 
functions, many of the a, often equal 0. 

Despite the finiteness of the representation in (38), it can often be the 
case that the number of terms in the expansion is too large in practice. 
This can be problematic in practice, since the time required to evaluate (38) 
is proportional to the number of terms. One can reduce this number by 
computing a reduced representation which approximates the original one in 
the RKHS norm (e.g., Scholkopf and Smola [118]). 

2.3.2. Regularization properties. The regularizer ||/||^ used in Theorem 9, 
which is what distinguishes SVMs from many other regularized function es- 
timators (e.g., based on coefficient based L\ regularizers, such as the Lasso 
(Tibshirani [135]) or linear programming machines (Scholkopf and Smola 
[118])), stems from the dot product (/, f)k in the RKHS H associated with a 
positive definite kernel. The nature and implications of this regularizer, how- 
ever, are not obvious and we shall now provide an analysis in the Fourier do- 
main. It turns out that if the kernel is translation invariant, then its Fourier 
transform allows us to characterize how the different frequency components 
of / contribute to the value of ||/||^. Our exposition will be informal (see 
also Poggio and Girosi [109] and Smola et al. [127]), and we will implicitly 
assume that all integrals are over M. d and exist, and that the operators are 
well defined. 

We will rewrite the RKHS dot product as 



where T is a positive (and thus symmetric) operator mapping Ji into a 
function space endowed with the usual dot product 



Rather than (39), we consider the equivalent condition (cf. Section 2.2.1) 



(39) 



(f,g)k = (Tf,Tg) = (T 2 f,g) 



(40) 
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which by the reproducing property (15) amounts to the desired equality 



For conditionally positive definite kernels, a similar correspondence can 
be established, with a regularization operator whose null space is spanned 
by a set of functions which are not regularized [in the case (17), which is 
sometimes called conditionally positive definite of order 1, these are the 
constants] . 

We now consider the particular case where the kernel can be written 
k(x,x') = h(x — x') with a continuous strictly positive definite function 
h € LiiW 1 ) (cf. Section 2.2.3). A variation of Bochner's theorem, stated by 
Wendland [152], then tells us that the measure corresponding to h has a 
nonvanishing density v with respect to the Lebesgue measure, that is, that 
k can be written as 



(43) k(x,x') = / e-^ x - x '^v{u)du:= / e^^e^^v^du. 



We would like to rewrite this as (Tk(x, ■),Tk(x', •)) for some linear operator 
T. It turns out that a multiplication operator in the Fourier domain will do 
the job. To this end, recall the (i-dimensional Fourier transform, given by 



(41). 



(44) 




(45) 



with the inverse F _1 [/](^) = {2n)~ d/2 / f{Lo)e i{x > u) duj. 



Next, compute the Fourier transform of k as 



(46) 



F[k{x, -)](u) = (2vr)~ d / 2 / / (t;(c/)e- iM )e i<a; ' V> dJe~ l{x '^ dx' 




{2^) d / 2 v{uj)e- i{x ' u]) . 



Hence, we can rewrite (43) as 




F[k(x,-)](u;)F[k(x',-)}(u;) 
v{uj) 



duj. 



r:f^(2ir)- d/2 v- 1/2 F[f] 



we thus have 



(49) 



k(x,x') = / (Tk(x, -))(uj)(Tk(x' , -))(uj) dw 



that is, our desired identity (41) holds true. 

As required in (39), we can thus interpret the dot product {f,g)k in the 
RKHS as a dot product J(T f)(to)(Tg)(Lo) du>. This allows us to understand 
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regularization properties of k in terms of its (scaled) Fourier transform v(uS). 
Small values of v{uo) amplify the corresponding frequencies in (48). Penal- 
izing (f,f)k thus amounts to a strong attenuation of the corresponding 
frequencies. Hence, small values of v(lv) for large ||o;|| are desirable, since 
high-frequency components of F[f] correspond to rapid changes in /. It 
follows that v(uj) describes the filter properties of the corresponding regu- 
larization operator T. In view of our comments following Theorem 7, we 
can translate this insight into probabilistic terms: if the probability measure 
v(u)dw describes the desired filter properties, then the natural translation 

J v(u>) aw 

invariant kernel to use is the characteristic function of the measure. 

2.3.3. Remarks and notes. The notion of kernels as dot products in 
Hilbert spaces was brought to the field of machine learning by Aizerman 
et al. [1], Boser at al. [23], Scholkopf at al. [119] and Vapnik [141]. Aizerman 
et al. [1] used kernels as a tool in a convergence proof, allowing them to ap- 
ply the Perceptron convergence theorem to their class of potential function 
algorithms. To the best of our knowledge, Boser et al. [23] were the first to 
use kernels to construct a nonlinear estimation algorithm, the hard margin 
predecessor of the Support Vector Machine, from its linear counterpart, the 
generalized portrait (Vapnik [139] and Vapnik and Lerner [145]). While all 
these uses were limited to kernels defined on vectorial data, Scholkopf [116] 
observed that this restriction is unnecessary, and nontrivial kernels on other 
data types were proposed by Haussler [62] and Watkins [151]. Scholkopf et al. 
[119] applied the kernel trick to generalize principal component analysis and 
pointed out the (in retrospect obvious) fact that any algorithm which only 
uses the data via dot products can be generalized using kernels. 

In addition to the above uses of positive definite kernels in machine learn- 
ing, there has been a parallel, and partly earlier development in the field of 
statistics, where such kernels have been used, for instance, for time series 
analysis (Parzen [106]), as well as regression estimation and the solution of 
inverse problems (Wahba [148]). 

In probability theory, positive definite kernels have also been studied in 
depth since they arise as covariance kernels of stochastic processes; see, 
for example, Loeve [93]. This connection is heavily being used in a subset 
of the machine learning community interested in prediction with Gaussian 
processes (Rasmussen and Williams [111]). 

In functional analysis, the problem of Hilbert space representations of 
kernels has been studied in great detail; a good reference is Berg at al. [18]; 
indeed, a large part of the material in the present section is based on that 
work. Interestingly, it seems that for a fairly long time, there have been 
two separate strands of development (Stewart [130]). One of them was the 
study of positive definite functions, which started later but seems to have 
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been unaware of the fact that it considered a special case of positive definite 
kernels. The latter was initiated by Hilbert [67] and Mercer [99], and was 
pursued, for instance, by Schoenberg [115]. Hilbert calls a kernel k definit if 

(50) f f k(x, x')f{x)f{x) dx dx > 

for all nonzero continuous functions /, and shows that all eigenvalues of the 
corresponding integral operator / i— > k(x, -)f(x) dx are then positive. If k 
satisfies the condition (50) subject to the constraint that f(x)g(x) dx = 0, 
for some fixed function g, Hilbert calls it relativ definit. For that case, he 
shows that k has at most one negative eigenvalue. Note that if / is chosen 
to be constant, then this notion is closely related to the one of conditionally 
positive definite kernels; see (17). For further historical details, see the review 
of Stewart [130] or Berg at al. [18]. 

3. Convex programming methods for estimation. As we saw, kernels 
can be used both for the purpose of describing nonlinear functions subject 
to smoothness constraints and for the purpose of computing inner products 
in some feature space efficiently. In this section we focus on the latter and 
how it allows us to design methods of estimation based on the geometry of 
the problems at hand. 

Unless stated otherwise, E[-] denotes the expectation with respect to all 
random variables of the argument. Subscripts, such as Ex[*], indicate that 
the expectation is taken over X . We will omit them wherever obvious. Fi- 
nally, we will refer to E emp [-] as the empirical average with respect to an 
n-sample. Given a sample S := {(xi, yi), . . . , (x n , y n )} C X x y, we now aim 
at finding an affine function f(x) = (w,(p(x)) + b or in some cases a func- 
tion f(x,y) = (4>(x,y),w) such that the empirical risk on S is minimized. 
In the binary classification case this means that we want to maximize the 
agreement between sgn/(x) and y. 

• Minimization of the empirical risk with respect to (w, b) is NP-hard (Min- 
sky and Papert [101]). In fact, Ben-David et al. [15] show that even ap- 
proximately minimizing the empirical risk is NP-hard, not only for linear 
function classes but also for spheres and other simple geometrical objects. 
This means that even if the statistical challenges could be solved, we still 
would be confronted with a formidable algorithmic problem. 

• The indicator function {yf(x) < 0} is discontinuous and even small changes 
in / may lead to large changes in both empirical and expected risk. Prop- 
erties of such functions can be captured by the VC-dimension (Vapnik 
and Chervonenkis [142]), that is, the maximum number of observations 
which can be labeled in an arbitrary fashion by functions of the class. 
Necessary and sufficient conditions for estimation can be stated in these 
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terms (Vapnik and Chervonenkis [143]). However, much tighter bounds 
can be obtained by also using the scale of the class (Alon et al. [3]). In 
fact, there exist function classes parameterized by a single scalar which 
have infinite VC-dimension (Vapnik [140]). 

Given the difficulty arising from minimizing the empirical risk, we now dis- 
cuss algorithms which minimize an upper bound on the empirical risk, while 
providing good computational properties and consistency of the estimators. 
A discussion of the statistical properties follows in Section 3.6. 

3.1. Support vector classification. Assume that S is linearly separable, 
that is, there exists a linear function f(x) such that sgny/(x) = 1 on S. In 
this case, the task of finding a large margin separating hyperplane can be 
viewed as one of solving (Vapnik and Lerner [145]) 

(51) minimize Mw \\ 2 subject to yi((w,x) + b) > 1. 

w,b 

Note that ||'u;||~ 1 /(rcj) is the distance of the point to the hyperplane 
H(w,b) := {x\(w,x) + 6 = 0}. The condition yif(xi) > 1 implies that the 
margin of separation is at least 2||u>||~ 1 . The bound becomes exact if equality 
is attained for some yi = 1 and yj = —1. Consequently, minimizing \\w\\ 
subject to the constraints maximizes the margin of separation. Equation (51) 
is a quadratic program which can be solved efficiently (Fletcher [51]). 

Mangasarian [95] devised a similar optimization scheme using ||w||i in- 
stead of \\w\\2 in the objective function of (51). The result is a linear pro- 
gram. In general, one can show (Smola et al. [124]) that minimizing the i v 
norm of w leads to the maximizing of the margin of separation in the £ q 
norm where ± + - = 1. The l\ norm leads to sparse approximation schemes 
(see also Chen et al. [29]), whereas the li norm can be extended to Hilbert 
spaces and kernels. 

To deal with nonseparable problems, that is, cases when (51) is infeasible, 
we need to relax the constraints of the optimization problem. Bennett and 
Mangasarian [17] and Cortes and Vapnik [34] impose a linear penalty on the 
violation of the large-margin constraints to obtain 

n 

minimize i 1 1 w \ \ 2 + C & 

w,b,£ 2 

(52) 

subject to yi({w : Xi) + b) > 1 — £j and £j > 0, Vi € [n\. 

Equation (52) is a quadratic program which is always feasible (e.g., w, b = 
and £i = 1 satisfy the constraints) . C > is a regularization constant trading 
off the violation of the constraints vs. maximizing the overall margin. 

Whenever the dimensionality of X exceeds n, direct optimization of (52) 
is computationally inefficient. This is particularly true if we map from X 
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into an RKHS. To address these problems, one may solve the problem in 
dual space as follows. The Lagrange function of (52) is given by 



C(w,b,(,a,rj) = ^\\w\\ 2 + C^Z& 

(53) 

n n 

+ Y^ a i 0- - & - Vi((w, Xi) + b))-^2 Vid, 
i=l i=l 

where ctj, r]i > for all »G[n]. To compute the dual of £, we need to identify 
the first order conditions in w,b. They are given by 

u 

d w C = w — otiUiXi = and 

i=l 

n 

(54) d b C = ~Y^ o-iDi = and 

i=l 

%£ = C - oti + rji = 0. 

This translates into w = J2?=i a iUi x ij ^ ne linear constraint Yli=i a iUi = 0) 
and the box-constraint 04 6 [0, C] arising from rji > 0. Substituting (54) into 
C yields the Wolfe dual 

minimize ^a T Qa — a T l subject to a T y = and aj G [0, C], Vi € [n]. 

(55) 

Q G ]R nxn is the matrix of inner products Qjj := t/iUj (xi, Xj). Clearly, this can 
be extended to feature maps and kernels easily via := UiUj ($(xj), $(xj)) = 
yil/jk(xi,Xj). Note that w lies in the span of the Xj. This is an instance of 
the representer theorem (Theorem 9). The KKT conditions (Boser et al. 
[23], Cortes and Vapnik [34], Karush [81] and Kuhn and Tucker [88]) require 
that at optimality aii(yif(xi) — 1) = 0. This means that only those Xj may 
appear in the expansion (54) for which yif(xi) < 1, as otherwise a.i = 0. The 
Xj with on > are commonly referred to as support vectors. 

Note that J27=i d is an upper bound on the empirical risk, as yif(xi) < 
implies £j > 1 (see also Lemma 10). The number of misclassified points Xj 
itself depends on the configuration of the data and the value of C. Ben-David 
et al. [15] show that finding even an approximate minimum classification 
error solution is difficult. That said, it is possible to modify (52) such that 
a desired target number of observations violates yif{xi) > p for some p 6 
R by making the threshold itself a variable of the optimization problem 
(Scholkopf et al. [120]). This leads to the following optimization problem 
(z^-SV classification): 

n 

minimize \ \ \ w \ | 2 + £j — nvp 



i=l 



KERNEL METHODS IN MACHINE LEARNING 



23 



(56) 

subject to Ui((w, Xi) + b) > p — £j and £j > 0. 

The dual of (56) is essentially identical to (55) with the exception of an 
additional constraint: 

minimize ^a T Qa subject to a T y = and a T l = nv and on £ [0, 1]. 

(57) 

One can show that for every C there exists a v such that the solution of 

(57) is a multiple of the solution of (55). Scholkopf et al. [120] prove that 
solving (57) for which p > satisfies the following: 

1. v is an upper bound on the fraction of margin errors. 

2. v is a lower bound on the fraction of SVs. 

Moreover, under mild conditions, with probability 1, asymptotically, v equals 
both the fraction of SVs and the fraction of errors. 

This statement implies that whenever the data are sufficiently well sepa- 
rable (i.e., p > 0), u-SV classification finds a solution with a fraction of at 
most v margin errors. Also note that, for u = l, all a.% = 1, that is, / becomes 
an affine copy of the Parzen windows classifier (5). 

3.2. Estimating the support of a density. We now extend the notion of 
linear separation to that of estimating the support of a density (Scholkopf 
et al. [117] and Tax and Duin [134]). Denote by X = {x\, . . . ,x n } C X the 
sample drawn from ~P(x). Let C be a class of measurable subsets of X and 
let A be a real-valued function defined on C. The quantile function (Einmal 
and Mason [47]) with respect to (P,A,C) is defined as 

(58) U(p) = inf{A(C) |P(C) >p,,CeC} where p € (0, 1]. 

We denote by C\(p) and C™(/x) the (not necessarily unique) C G C that 
attain the infimum (when it is achievable) on ~P(x) and on the empirical 
measure given by X respectively. A common choice of A is the Lebesgue 
measure, in which case C\(p) is the minimum volume set C GC that contains 
at least a fraction p of the probability mass. 

Support estimation requires us to find some C™(p) such that |P(C™(//)) — 
p\ is small. This is where the complexity trade-off enters: On the one hand, 
we want to use a rich class C to capture all possible distributions, on the 
other hand, large classes lead to large deviations between p and P(C^ (//)). 
Therefore, we have to consider classes of sets which are suitably restricted. 
This can be achieved using an SVM regularizer. 

SV support estimation works by using SV support estimation related 
to previous work as follows: set X(C W ) = \\w\\ 2 , where C w = {x\f w (x) > p}, 
fw(x) = {w,x), and (w,p) are respectively a weight vector and an offset. 
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Stated as a convex optimization problem, we want to separate the data 
from the origin with maximum margin via 

n 

minimize h\\w \\ 2 + & — nvp 

w*,P i=1 

(59) 

subject to (w, Xi) > p — £i and & > 0. 

Here, v 6 (0,1] plays the same role as in (56), controlling the number of 
observations %i for which f(xi) < p. Since nonzero slack variables are 
penalized in the objective function, if w and p solve this problem, then the 
decision function f(x) will attain or exceed p for at least a fraction \ — voi 
the Xi contained in X, while the regularization term ||to|| will still be small. 
The dual of (59) yield: 

(60) minimize ^a T Ka subject to a T l = ra and ^ £ [0,1]. 

To compare (60) to a Parzen windows estimator, assume that k is such that 
it can be normalized as a density in input space, such as a Gaussian. Using 
v = 1 in (60), the constraints automatically imply oti = 1. Thus, / reduces to 
a Parzen windows estimate of the underlying density. For v < 1 , the equality 
constraint (60) still ensures that / is a thresholded density, now depending 
only on a subset of X — those which are important for deciding whether 
f(x)<p. 

3.3. Regression estimation. SV regression was first proposed in Vapnik 
[140] and Vapnik et al. [144] using the so-called e-insensitive loss function. It 
is a direct extension of the soft-margin idea to regression: instead of requiring 
that yf(x) exceeds some margin value, we now require that the values y — 
f(x) are bounded by a margin on both sides. That is, we impose the soft 
constraints 

(61) yi- f{xi) <ei-£i and f(xi)-yi<6i-£*, 

where — 0. If — f{x{) \ < e, no penalty occurs. The objective function 
is given by the sum of the slack variables penalized by some C > and 
a measure for the slope of the function f(x) = (w,x) + b, that is, !||w;|| 2 . 

Before computing the dual of this problem, let us consider a somewhat 
more general situation where we use a range of different convex penalties 
for the deviation between yi and f{xj). One may check that minimizing 
^||u;|| 2 + CY^ZLiii subject to (61) is equivalent to solving 

n 

(62) minimize i || w|| 2 + y^V(yi — /(^i)) where = max(0, |£| — e). 



Choosing different loss functions ip leads to a rather rich class of estimators: 
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• V'(C) = \i 2 yields penalized least squares (LS) regression (Hoerl and Ken- 
nard [68], Morozov [102], Tikhonov [136] and Wahba [148]). The corre- 
sponding optimization problem can be minimized by solving a linear sys- 
tem. 

• For ip(£,) = |£|, we obtain the penalized least absolute deviations (LAD) 
estimator (Bloomfield and Steiger [20]). That is, we obtain a quadratic 
program to estimate the conditional median. 

• A combination of LS and LAD loss yields a penalized version of Huber's 
robust regression (Huber [71] and Smola and Scholkopf [126]). In this case 
we have = for |£j < a and ^(0 = \t\ - § for |f | > a. 

• Note that also quantile regression can be modified to work with kernels 
(Scholkopf et al. [120]) by using as loss function the "pinball" loss, that 
is, -0(0 = (1 - if tp < and ?/>(£) = Tip if ip > 0. 

All the optimization problems arising from the above five cases are convex 
quadratic programs. Their dual resembles that of (61), namely, 

(63a) minimize i(a — a*) T K(a — a*) + e T (a + a*) — y T (a — a*) 

a, a* 

(63b) subject to (a — a*) T l = and ai, a* S [0, C}. 

Here = (xi,Xj) for linear models and = k(xi,Xj) if we map x — > &(x). 
The z/-trick, as described in (56) (Scholkopf et al. [120]), can be extended 
to regression, allowing one to choose the margin of approximation automat- 
ically. In this case (63a) drops the terms in e. In its place, we add a linear 
constraint (a — a*) T l = vn. Likewise, LAD is obtained from (63) by drop- 
ping the terms in e without additional constraints. Robust regression leaves 
(63) unchanged, however, in the definition of K we have an additional term 
of ex -1 on the main diagonal. Further details can be found in Scholkopf and 
Smola [118]. For quantile regression we drop e and we obtain different con- 
stants C(l — t) and Cr for the constraints on a* and a. We will discuss 
uniform convergence properties of the empirical risk estimates with respect 
to various -0(0 in Section 3.6. 

3.4. Multicategory classification, ranking and ordinal regression. Many 
estimation problems cannot be described by assuming that y = {±1}. In 
this case it is advantageous to go beyond simple functions f(x) depend- 
ing on x only. Instead, we can encode a larger degree of information by 
estimating a function f(x,y) and subsequently obtaining a prediction via 
y(x) := argmaxy^y f(x, y). In other words, we study problems where y is 
obtained as the solution of an optimization problem over f(x,y) and we 
wish to find / such that y matches yi as well as possible for relevant inputs 
x. 
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Note that the loss may be more than just a simple 0-1 loss. In the following 
we denote by A(y, y') the loss incurred by estimating y' instead of y. Without 
loss of generality, we require that A(y,y) = and that A(y,y') > for all 
V,y' G y~ Key in our reasoning is the following: 

Lemma 10. Let f:X x y — > E and assume that A(y,y') > with 
A(y, y) = 0- Moreover, let £ > suc/i i/tai f(x,y) — f(x,y') > A(y,y') — £ 
for all y' G -fa case £ > A (y, argmax^/gy /(x, ?/')). 

The construction of the estimator was suggested by Taskar et al. [132] and 
Tsochantaridis et al. [137], and a special instance of the above lemma is given 
by Joachims [78]. While the bound appears quite innocuous, it allows us to 
describe a much richer class of estimation problems as a convex program. 

To deal with the added complexity, we assume that / is given by f(x, y) = 
($>(x,y),w). Given the possibly nontrivial connection between x and y, the 
use of &(x,y) cannot be avoided. Corresponding kernel functions are given 
by k(x,y,x' ,y') = &(x',y')). We have the following optimization 

problem (Tsochantaridis et al. [137]): 

n 

minimize i 1 1 w \ 1 2 + C £s 

»* 2 h 

(64) 

subject to (w,$(xi,yi) - $(xi,y)) > A(y u y) - £ u Vz G [n\,y G y. 

This is a convex optimization problem which can be solved efficiently if the 
constraints can be evaluated without high computational cost. One typi- 
cally employs column-generation methods (Bennett et al. [16], Fletcher [51], 
Hettich and Kortanek [66] and Tsochantaridis et al. [137]) which identify 
one violated constraint at a time to find an approximate minimum of the 
optimization problem. 

To describe the flexibility of the framework set out by (64) we give several 
examples below: 

• Binary classification can be recovered by setting <E>(x, y) = y&(x), in which 
case the constraint of (64) reduces to 2yi(<&(xi), w) >!. — &. Ignoring con- 
stant offsets and a scaling factor of 2, this is exactly the standard SVM 
optimization problem. 

• Multicategory classification problems (Allwein et al. [2], Collins [30] and 
Crammer and Singer [35] ) can be encoded via y = [N] , where iV is the 
number of classes and A(y,y') = 1 — 8y : y'. In other words, the loss is 1 
whenever we predict the wrong class and for correct classification. Cor- 
responding kernels are typically chosen to be 5y t y>k(x,x'). 

• We can deal with joint labeling problems by setting y = {±l} n . In other 
words, the error measure does not depend on a single observation but on 
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an entire set of labels. Joachims [78] shows that the so-called F\ score 
(van Rijsbergen [138]) used in document retrieval and the area under the 
ROC curve (Bamber [10]) fall into this category of problems. Moreover, 
Joachims [78] derives an 0(n 2 ) method for evaluating the inequality con- 
straint over y. 

• Multilabel estimation problems deal with the situation where we want to 
find the best subset of labels y C 2^ which correspond to some ob- 
servation x. Elisseeff and Weston [48] devise a ranking scheme where 
f{x, i) > f(x,j) if label iGy and j ^ y. It is a special case of an approach 
described next. 

Note that (64) is invariant under translations <&(x,y) +— <&(x,y) + <I>o where 
$0 is constant, as $(xj,yj) — &(xi,y) remains unchanged. In practice, this 
means that transformations k(x,y,x' ,y') <— k(x,y,x' ,y') + {&Q,&(x,y)) + 
($0) ^{ x 'i y')) + ll^oll 2 do not affect the outcome of the estimation process. 
Since &q was arbitrary, we have the following lemma: 

Lemma 11. Let 7i be an RKHS on X x y with kernel k. Moreover, let 
g £TC. Then the function k(x, y, x', y') + f(x, y) + f(x', y') + ||g||^ is a kernel 
and it yields the same estimates as k. 

We need a slight extension to deal with general ranking problems. Denote 
by 3^ = Graph [N] the set of all directed graphs on N vertices which do not 
contain loops of less than three nodes. Here an edge (i, j) G y indicates that 
i is preferred to j with respect to the observation x. It is the goal to find 
some function / : X x [N] — > M which imposes a total order on [N] (for a 
given x) by virtue of the function values f(x,i) such that the total order 
and y are in good agreement. 

More specifically, Crammer and Singer [36] and Dekel et al. [45] propose a 
decomposition algorithm A for the graphs y such that the estimation error 
is given by the number of subgraphs of y which are in disagreement with 
the total order imposed by /. As an example, multiclass classification can 
be viewed as a graph y where the correct label i is at the root of a directed 
graph and all incorrect labels are its children. Multilabel classification is 
then a bipartite graph where the correct labels only contain outgoing arcs 
and the incorrect labels only incoming ones. 

This setting leads to a form similar to (64), except for the fact that we 
now have constraints over each subgraph G E A(y). We solve 

minimize \ |[w;|[ 2 + \A(yi) | _1 ^ 
W * *=1 GeA( Vi ) 

(65) subject to (w, $(xi,u) — > 1 — and > 

for all (u,v) G G G A(yi). 
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That is, we test for all (u, v) £ G whether the ranking imposed by G £ yi is 
satisfied. 

Finally, ordinal regression problems which perform ranking not over la- 
bels y but rather over observations x were studied by Herbrich et al. [65] and 
Chapelle and Harchaoui [27] in the context of ordinal regression and conjoint 
analysis respectively. In ordinal regression x is preferred to x' if f{x) > fix 1 ) 
and, hence, one minimizes an optimization problem akin to (64), with con- 
straint {w,<&(xi) — $>(xj)) > 1 — £y. In conjoint analysis the same operation 
is carried out for <&(x,u), where u is the user under consideration. Similar 
models were also studied by Basilico and Hofmann [13]. Further models will 
be discussed in Section 4, in particular situations where y is of exponential 
size. These models allow one to deal with sequences and more sophisticated 
structures. 

3.5. Applications of SVM algorithms. When SVMs were first presented, 
they initially met with skepticism in the statistical community. Part of the 
reason was that, as described, SVMs construct their decision rules in poten- 
tially very high-dimensional feature spaces associated with kernels. Although 
there was a fair amount of theoretical work addressing this issue (see Sec- 
tion 3.6 below), it was probably to a larger extent the empirical success of 
SVMs that paved its way to become a standard method of the statistical 
toolbox. The first successes of SVMs on practical problems were in handwrit- 
ten digit recognition, which was the main benchmark task considered in the 
Adaptive Systems Department at AT&T Bell Labs where SVMs were devel- 
oped. Using methods to incorporate transformation invariances, SVMs were 
shown to beat the world record on the MNIST benchmark set, at the time 
the gold standard in the field (DeCoste and Scholkopf [44]). There has been 
a significant number of further computer vision applications of SVMs since 
then, including tasks such as object recognition and detection. Nevertheless, 
it is probably fair to say that two other fields have been more influential in 
spreading the use of SVMs: bioinformatics and natural language processing. 
Both of them have generated a spectrum of challenging high-dimensional 
problems on which SVMs excel, such as microarray processing tasks and 
text categorization. For references, see Joachims [77] and Scholkopf et al. 
[121]. 

Many successful applications have been implemented using SV classifiers; 
however, also the other variants of SVMs have led to very good results, 
including SV regression, SV novelty detection, SVMs for ranking and, more 
recently, problems with interdependent labels (McCallum et al. [96] and 
Tsochantaridis et al. [137]). 

At present there exists a large number of readily available software pack- 
ages for SVM optimization. For instance, SVMStruct, based on Tsochan- 
taridis et al. [137] solves structured estimation problems. LibSVM is an 
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open source solver which excels on binary problems. The Torch package 
contains a number of estimation methods, including SVM solvers. Several 
SVM implementations are also available via statistical packages, such as R. 

3.6. Margins and uniform convergence bounds. While the algorithms 
were motivated by means of their practicality and the fact that 0-1 loss 
functions yield hard-to-control estimators, there exists a large body of work 
on statistical analysis. We refer to the works of Bartlett and Mendelson [12], 
Jordan et al. [80], Koltchinskii [86], Mendelson [98] and Vapnik [141] for 
details. In particular, the review of Bousquet et al. [24] provides an excel- 
lent summary of the current state of the art. Specifically for the structured 
case, recent work by Collins [30] and Taskar et al. [132] deals with explicit 
constructions to obtain better scaling behavior in terms of the number of 
class labels. 

The general strategy of the analysis can be described by the following 
three steps: first, the discrete loss is upper bounded by some function, such 
as ip{yf{x)), which can be efficiently minimized [e.g. the soft margin function 
max(0, 1 — yf(x)) of the previous section satisfies this property]. Second, one 
proves that the empirical average of the V-l° ss is concentrated close to its 
expectation. This will be achieved by means of Rademacher averages. Third, 
one shows that under rather general conditions the minimization of the ip- 
loss is consistent with the minimization of the expected risk. Finally, these 
bounds are combined to obtain rates of convergence which only depend on 
the Rademacher average and the approximation properties of the function 
class under consideration. 

4. Statistical models and RKHS. As we have argued so far, the repro- 
ducing kernel Hilbert space approach offers many advantages in machine 
learning: (i) powerful and flexible models can be defined, (ii) many results 
and algorithms for linear models in Euclidean spaces can be generalized 
to RKHS, (iii) learning theory assures that effective learning in RKHS is 
possible, for instance, by means of regularization. 

In this chapter we will show how kernel methods can be utilized in the 
context of statistical models. There are several reasons to pursue such an av- 
enue. First of all, in conditional modeling, it is often insufficient to compute a 
prediction without assessing confidence and reliability. Second, when dealing 
with multiple or structured responses, it is important to model dependen- 
cies between responses in addition to the dependence on a set of covariates. 
Third, incomplete data, be it due to missing variables, incomplete training 
targets or a model structure involving latent variables, needs to be dealt 
with in a principled manner. All of these issues can be addressed by using 
the RKHS approach to define statistical models and by combining kernels 
with statistical approaches such as exponential models, generalized linear 
models and Markov networks. 
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4.1. Exponential RKHS models. 

4.1.1. Exponential models. Exponential models or exponential families 
are among the most important class of parametric models studied in statis- 
tics. Given a canonical vector of statistics $ and a cr-finite measure v over 
the sample space 3C , an exponential model can be defined via its probability 
density with respect to v (cf. Barndorff-Nielsen [11]), 

P (x;e)=ex V [(9,<t>(x))-g{9)} 

(66) 

where g(9) :=ln / e< '*^> dvix). 
Jx 

The m-dimensional vector 9 6 O with G := {9 G K m : < oo} is also called 
the canonical parameter vector. In general, there are multiple exponential 
representations of the same model via canonical parameters that are affinely 
related to one another (Murray and Rice [103]). A representation with min- 
imal m is called a minimal representation, in which case m is the order of 
the exponential model. One of the most important properties of exponential 
families is that they have sufficient statistics of fixed dimensionality, that is, 
the joint density for i.i.d. random variables X\,X2, ■ ■ ■ ,X n is also exponen- 
tial, the corresponding canonical statistics simply being YA=i^{^i)- It is 
well known that much of the structure of exponential models can be derived 
from the log partition function g(9), in particular, 

(67) V9 0(0)=M(0):=E*[*(*)]. 8$g(0) = VeMX)], 

where \i is known as the mean- value map. Being a covariance matrix, the 
Hessian of g is positive semi-definite and, consequently, g is convex. 

Maximum likelihood estimation (MLE) in exponential families leads to 
a particularly elegant form for the MLE equations: the expected and the 
observed canonical statistics agree at the MLE 9. This means, given an 
i.i.d. sample S = (xi) ie [ n j, 

1 - 

(68) EgppQ] =n{0) = -5>(0 :=rE s MX)]. 

n . 

4.1.2. Exponential RKHS models. One can extend the parameteric ex- 
ponential model in (66) by defining a statistical model via an RKHS TL 
with generating kernel k. Linear function (#,$(•)) over X are replaced with 
functions / G H., which yields an exponential RKHS model 

p(x;f) =exp[f(x)-g(f)], 

(69) 

fen :={/:/(•) = E «**;(•, x),5 C X, \S\ < ool. 
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A justification for using exponential RKHS families with rich canonical 
statistics as a generic way to define nonparametric models stems from the 
fact that if the chosen kernel k is powerful enough, the associated exponential 
families become universal density estimators. This can be made precise using 
the concept of universal kernels (Steinwart [128], cf. Section 2). 

Proposition 12 (Dense densities). Let X be a measurable set with a 
fixed a -finite measure v and denote by V a family of densities on X with 
respect to v such that p&V is uniformly bounded from above and continuous. 
Let k : X x X — > R be a universal kernel for Tt. Then the exponential RKHS 
family of densities generated by k according to equation (69) is dense in V 
in the L^ sense. 

4.1.3. Conditional exponential models. For the rest of the paper we will 
focus on the case of predictive or conditional modeling with a — potentially 
compound or structured — response variable Y and predictor variables X. 
Taking up the concept of joint kernels introduced in the previous section, we 
will investigate conditional models that are defined by functions / : X x y — > 
M from some RKHS 7i over X x y with kernel k as follows: 



Jy 

Notice that in the finite-dimensional case we have a feature map <1? : X x y — > 
l m from which parametric models are obtained via TL := {/ :3w, f(x,y) = 
f(x,y;w) := (w, <3?(x, y)}} and each / can be identified with its parameter 
w. Let us discuss some concrete examples to illustrate the rather general 
model equation (70): 

• Let Y be univariate and define &(x,y) =y<&(x). Then simply f(x,y;w) = 
{w,<&(x,y)} = yf(x;w), with f(x;w) := (w,Q(x)) and the model equation 
in (70) reduces to 



This is a generalized linear model (GLM) (McCullagh and Nelder [97]) 
with a canonical link, that is, the canonical parameters depend linearly 
on the covariates <£(x). For different response scales, we get several well- 
known models such as, for instance, logistic regression where y 6 {— 1, 1}. 
• In the nonparameteric extension of generalized linear models following 
Green and Yandell [57] and O'Sullivan [105] the parametric assumption 
on the linear predictor f(x;w) = {w,&(x)) in the GLMs is relaxed by 
requiring that / comes from some sufficiently smooth class of functions, 



P(y\x; f) = exp[/(x, y) - g(x, /)] 



(70) 




(71) 



p(y\x;w) =exp[y(w,<^(x)} - g(x,w)]. 
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namely, an RKHS denned over X . In combination with a parametric part, 
this can also be used to define semi-parametric models. Popular choices of 
kernels include the ANOVA kernel investigated by [149] . This is a special 
case of defining joint kernels from an existing kernel k over inputs via 
k{{x,y),{ x '^y')) •-yy'kix.x'). 

• Joint kernels provide a powerful framework for prediction problems with 
structured outputs. An illuminating example is statistical natural lan- 
guage parsing with lexicalized probabilistic context free grammars (Mager- 
man [94]). Here x will be an English sentence and y a parse tree for x, 
that is, a highly structured and complex output. The productions of the 
grammar are known, but the conditional probability p(y\x) needs to be 
estimated based on training data of parsed/annotated sentences. In the 
simplest case, the extracted statistics $ may encode the frequencies of the 
use of different productions in a sentence with a known parse tree. More 
sophisticated feature encodings are discussed in Taskar et al. [133] and 
Zettlemoyer and Collins [156]. The conditional modeling approach pro- 
vide alternatives to state-of-the art approaches that estimate joint models 
p(x,y) with maximum likelihood or maximum entropy and obtain predic- 
tive models by conditioning on x. 

4.1.4. Risk functions for model fitting. There are different inference prin- 
ciples to determine the optimal function / £ TL for the conditional exponen- 
tial model in (70). One standard approach to parametric model fitting is to 
maximize the conditional log-likelihood — or equivalently — minimize a log- 
arithmic loss, a strategy pursued in the Conditional Random Field (CRF) 
approach of Lafferty [90] . Here we consider the more general case of minimiz- 
ing a functional that includes a monotone function of the Hilbert space norm 
1 1 as a stabilizer (Wahba [148]). This reduces to penalized log-likelihood 
estimation in the finite-dimensional case, 

1 ™ 

C\f-S) :=—J2^P(yi\xi;f), 
n r— : 

(72) 

f\S) := argmm^l f\\ 2 H + C\f;S). 
fen 2 

• For the parametric case, Lafferty et al. [90] have employed variants of 
improved iterative scaling (Darroch and Ratcliff [40] and Delia Pietra 
[46]) to optimize equation (72), whereas Sha and Pereira [122] have in- 
vestigated preconditioned conjugate gradient descent and limited memory 
quasi-Newton methods. 

• In order to optimize equation (72) one usually needs to compute expecta- 
tions of the canonical statistics Ej[$>(Y, x)} at sample points x = Xi, which 
requires the availability of efficient inference algorithms. 
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As we have seen in the case of classification and regression, likelihood- 
based criteria are by no means the only justifiable choice and large margin 
methods offer an interesting alternative. To that extend, we will present 
a general formulation of large margin methods for response variables over 
finite sample spaces that is based on the approach suggested by Altun et al. 
[6] and Taskar et al. [132]. Define 

r(x,y;f) := f(x,y) - max f(x,y') = minlog p( f}^^} and 
y'+y v'+y P{y\x;J) 

r(S;f) :=mmr{xi,yi]f). 
i=i 

Here r(S;f) generalizes the notion of separation margin used in SVMs. 
Since the log-odds ratio is sensitive to rescaling of /, that is, r(x,y;Pf) = 
j3r{x, y; /), we need to constrain \\f\\n to make the problem well defined. We 
thus replace / by for some fixed dispersion parameter <j) > and define 
the maximum margin problem / mm (5) := 4>~ l argmax||j|| w=1 r{S\ f /4>). For 
the sake of the presentation, we will drop eft in the following. (We will not 
deal with the problem of how to estimate <p here; note, however, that one 
does need to know cf) in order to make an optimal deterministic prediction.) 
Using the same line of arguments as was used in Section 3, the maximum 
margin problem can be re-formulated as a constrained optimization problem 



(74) / mm (<S):=argimn±||/||^ s.t. r{x u m /) > 1, Vi € 



n 



fen 



provided the latter is feasible, that is, if there exists f €TC such that r(S; f) > 
0. To make the connection to SVMs, consider the case of binary classi- 
fication with $(x,y) = y$(x), f(x,y;w) = (w, y&(x)), where r(x,y;f) = 
(w, y&(x)) — (w, —y&(x)) = 2y(w, $>(x)) = 2p(x, y; w). The latter is twice the 
standard margin for binary classification in SVMs. 

A soft margin version can be defined based on the Hinge loss as follows: 



1 n 

C u (f;S) :=-J2^H^-r(xi,yi;f),0}, 

(75) 



n . 
i=i 



/ sm (5) :=argmin^||/||^ + C hl (/,5). 
fen ^ 

• An equivalent formulation using slack variables £i as discussed in Section 3 
can be obtained by introducing soft-margin constraints r(xi,yf, f) > 1 — 
£i > and by defining C hl = — £j. Each nonlinear constraint can be further 
expanded into \y\ linear constraints f(x{,yi) — f(xi,y) > 1 — £j for all 

y + Vi- 
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• Prediction problems with structured outputs often involve task-specific 
loss function A : y x y — > R discussed in Section 3.4. As suggested in 
Taskar et al. [132] cost sensitive large margin methods can be obtained by 
defining re-scaled margin constraints f(xi,yi) — f(xi,y) > A(yi,y) — £j. 

• Another sensible option in the parametric case is to minimize an expo- 
nential risk function of the following type: 

1 n 

(76) / cxp (5) :=argmin-^ ^ exp[f(xi,yi; w) - f(xu y; w)]. 

w i=1 v+Vi 

This is related to the exponential loss used in the AdaBoost method of 
Freund and Schapire [53]. Since we are mainly interested in kernel-based 
methods here, we refrain from further elaborating on this connection. 

4.1.5. Generalized representer theorem and dual soft-margin formulation. 
It is crucial to understand how the representer theorem applies in the set- 
ting of arbitrary discrete output spaces, since a finite representation for the 
optimal / e {/", / s m } is the basis for constructive model fitting. Notice that 
the regularized log-loss, as well as the soft margin functional introduced 
above, depends not only on the values of / on the sample S, but rather on 
the evaluation of / on the augmented sample S := {(xi,y) :i £ [n],y € y}. 
This is the case, because for each Xi, output values y ^yi not observed with 
Xi show up in the log-partition function g(xi,f) in (70), as well as in the 
log-odds ratios in (73). This adds an additional complication compared to 
binary classification. 

Corollary 13. Denote by H an RKHS on X x y with kernel k and 
let S= ((xi, yi))i£\ n ] ■ Furthermore, let C(f;S) be a functional depending 
on f only via its values on the augmented sample S. Let Q be a strictly 
monotonically increasing function. Then the solution of the optimization 
problem f(S) := argminj g ^C(/;5) + 0(||/||^) can be written as 

n 

(77) /(•)=EEW--fe!'))' 

i=i y ey 

This follows directly from Theorem 9. 

Let us focus on the soft margin maximizer / sm . Instead of solving (75) 
directly, we first derive the dual program, following essentially the derivation 
in Section 3. 

Proposition 14 (Tsochantaridis et al. [137]). The minimizer f sm (S) 
can be written as in Corollary 13, where the expansion coefficients can be 
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computed from the solution of the following convex quadratic program: 

{n n ~\ 

sEE E a iy a iy' K iy,iv' "EE a % 
hi^v^Viv'^Vj i=^y¥=yi > 

(78b) s.t. An ct^ < 1 , Vi G [n]; > 0, Vi G [n] , y G ^, 

where K iyJy < := fe((x i5 y^), (xj, yj)) + k((xi, y), (xj,y')) - k((x i: y^), (xj, y')) - 
k((xi,y),(xj,yj)). 

• The multiclass SVM formulation of [35] can be recovered as a special 
case for kernels that are diagonal with respect to the outputs, that is, 
k((x,y), (x',y')) = S yiy 'k(x,x'). Notice that in this case the quadratic part 
in equation (78a) simplifies to 

si k(xi,Xj) 22 a iy a jy[^ + Syi,y3yj,y ~ ^y t ,y ~ ^yj,y\- 

i,j y 

• The pairs (xj,y) for which cti y > are the support pairs, generalizing 
the notion of support vectors. As in binary SVMs, their number can be 
much smaller than the total number of constraints. Notice also that in the 
final expansion contributions k(-,(xi,yi)) will get nonnegative weights, 
whereas k(-,(xi,y)) for y ^ y-i will get nonpositive weights. Overall one 
gets a balance equation 0i yi — J2 y ^ yi @W = for every data point. 

4.1.6. Sparse approximation. Proposition 14 shows that sparseness in 
the representation of / sm is linked to the fact that only few ai y in the so- 
lution to the dual problem in equation (78) are nonzero. Note that each of 
these Lagrange multipliers is linked to the corresponding soft margin con- 
straints f(xi, yj) — f(xi, y) > 1 — £j. Hence, sparseness is achieved, if only few 
constraints are active at the optimal solution. While this may or may not be 
the case for a given sample, one can still exploit this observation to define 
a nested sequence of relaxations, where margin constraint are incrementally 
added. This corresponds to a constraint selection algorithm (Bertsimas and 
Tsitsiklis [19]) for the primal or, equivalently, a variable selection or col- 
umn generation method for the dual program and has been investigated in 
Tsochantaridis et al. [137]. Solving a sequence of increasingly tighter relax- 
ations to a mathematical problem is also known as an outer approximation. 
In particular, one may iterate through the training examples according to 
some (fair) visitation schedule and greedily select constraints that are most 
violated at the current solution /, that is, for the ith instance one computes 



(79) 



jji = argmax/(x i ,y) = argmaxp(y|xj; /), 
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and then strengthens the current relaxation by including on yi in the op- 
timization of the dual if f(xi,yi) — f(xi,yi) < 1 — & — e. Here e > is a 
pre-defined tolerance parameter. It is important to understand how many 
strengthening steps are necessary to achieve a reasonable close approxima- 
tion to the original problem. The following theorem provides an answer: 

Theorem 15 (Tsochantaridis et al. [137]). Let R = max^yKiyiy and 
choose e > 0. A sequential strengthening procedure, which optimizes equa- 
tion (75) by greedily selecting e-violated constraints, will find an approxi- 
mate solution where all constraints are fulfilled within a precision of e, that 
is, r(xi,yn /) > 1 — & — e after at most ?f ■ max{l, jS^} steps. 

Corollary 16. Denote by (/,£) the optimal solution of a relaxation 
of the problem in Proposition 14, minimizing TZ(f,£,S) while violating no 
constraint by more than e (cf. Theorem 15). Then 

n(f, e, s) < K(n, e,s)< mf, e, s)+e, 

where (/ sm ,£*) is the optimal solution of the original problem. 

• Combined with an efficient QP solver, the above theorem guarantees a 
runtime polynomial in n, e , R and A -1 . This holds irrespective of spe- 
cial properties of the data set utilized, the only exception being the de- 
pendency on the sample points x% is through the radius R. 

• The remaining key problem is how to compute equation (79) efficiently. 
The answer depends on the specific form of the joint kernel k and/or 
the feature map In many cases, efficient dynamic programming tech- 
niques exists, whereas in other cases one has to resort to approximations 
or use other methods to identify a set of candidate distractors ^ C y for 
a training pair (xi,yi) (Collins [30]). Sometimes one may also have search 
heuristics available that may not find the solution to Equation (79), but 
that find (other) e-violating constraints with a reasonable computational 
effort. 

4.1.7. Generalized Gaussian processes classification. The model 
equation (70) and the minimization of the regularized log-loss can be in- 
terpreted as a generalization of Gaussian process classification (Altun et al. 
[4] and Rasmussen and Williams [111]) by assuming that (f(x, -)) x eX i s a 
vector-valued zero mean Gaussian process; note that the covariance function 
C is defined over pairs X x y. For a given sample S, define a multi-index 
vector F(S) := (f{xi,y))i iy as the restriction of the stochastic process / to 
the augmented sample S. Denote the kernel matrix by K = (Kiyjy'), where 
K iy j y / := C((xi,y), (xj,y')) with indices i,j 6 [n] and y,y' £ y, so that, in 
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summary, F(S) ~7V(0,A'). This induces a predictive model via Bayesian 
model integration according to 

(80) p(y\x; S) = J p(y\F(x, -))p(F\S) dF, 

where x is a test point that has been included in the sample (transductive 
setting). For an i.i.d. sample, the log-posterior for F can be written as 

n 

(81) lnp(F\S) = -^K^F + J2[f(x u yi) - g(xi,F)] + const. 

i=l 

Invoking the representer theorem for F(S) := argmaxplnp(.F|<S), we know 
that 

n 

(82) F(S) iy = J2Y1 a iy K iy,jy'i 

j=iy'ey 

which we plug into equation (81) to arrive at 

(83) mino! T Ka — ^ a T Kej y4 + log expfa^Ke^] , 

i=l V y&y / 

where ei y denotes the respective unit vector. Notice that for /(•) = J2i. y ctiyk(-, 
(xi,y)) the first term is equivalent to the squared RKHS norm of / G 7i since 
(f,f)H = T l i,jT iy , y 'Oiiya j y>{k(-,(xi,y)),k{;(xj,y'))). The latter inner prod- 
uct reduces to k((xi,y), (xj,y')) due to the reproducing property. Again, the 
key issue in solving (83) is how to achieve spareness in the expansion for F. 



4.2. Markov networks and kernels. In Section 4.1 no assumptions about 
the specific structure of the joint kernel defining the model in equation (70) 
has been made. In the following, we will focus on a more specific setting 
with multiple outputs, where dependencies are modeled by a conditional 
independence graph. This approach is motivated by the fact that indepen- 
dently predicting individual responses based on marginal response models 
will often be suboptimal and explicitly modeling these interactions can be 
of crucial importance. 

4.2.1. Markov networks and factorization theorem. Denote predictor vari- 
ables by X, response variables by Y and define Z := (X, Y) with associated 
sample space Z. We use Markov networks as the modeling formalism for rep- 
resenting dependencies between covariates and response variables, as well as 
interdependencies among response variables. 
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Definition 17. A conditional independence graph (or Markov network) 
is an undirected graph Q = (Z, E) such that for any pair of variables (Zj, Zj) ^ 
E if and only if Zj JL Zj\Z — {Zi,Zj}. 

The above definition is based on the pairwise Markov property, but by 
virtue of the separation theorem (see, e.g., Whittaker [154]) this implies 
the global Markov property for distributions with full support. The global 
Markov property says that for disjoint subsets U,V,W C Z where W sep- 
arates U from V in Q one has that U JL V|W. Even more important in 
the context of this paper is the factorization result due to Hammersley and 
Clifford [61]. 

Theorem 18. Given a random vector Z with conditional independence 
graph Q, any density function for Z with full support factorizes over ^{Q), 
the set of maximal cliques of Q as follows: 

(84) p(z)=exp J2 fc{z c 

-c&V{G) 

where f c are clique compatibility functions dependent on z only via the re- 
striction on clique configurations z c . 

The significance of this result is that in order to specify a distribution for 
Z, one only needs to specify or estimate the simpler functions f c . 

4.2.2. Kernel decomposition over Markov networks. It is of interest to 
analyze the structure of kernels k that generate Hilbert spaces H of functions 
that are consistent with a graph. 

Definition 19. A function f:Z— >M. is compatible with a conditional 
independence graph Q, if / decomposes additively as f(z) = J2ce^(S) fc( z c) 
with suitably chosen functions f c . A Hilbert space TL over Z is compatible 
with Q, if every function / G Ti is compatible with Q. Such / and Ti are also 
called (/-compatible. 

Proposition 20. Let H with kernel k be a Q-compatible RKHS. Then 
there are functions k c d : Z c x — > K such that the kernel decomposes as 

k(u,z) = k cd (u c ,z d ). 

Lemma 21. Let X be a set of n-tupels and fi,gi\X x X — >• IR for i G [n] 
functions such that fi(x,y) = /i(x»,y) andgi(x,y) =gi(x,yi). IfY,ifi(%i,y) = 
J2j 9j( x iUj) f or allx,y, then there exist functions hij such that J2i fi(xi,y) = 
Sij hij(xi,yj) . 
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• Proposition 20 is useful for the design of kernels, since it states that only 
kernels allowing an additive decomposition into local functions k c d are 
compatible with a given Markov network Q . Lafferty et al. [89] have pur- 
sued a similar approach by considering kernels for RKHS with functions 
defined over Z^g := {(c, z c ) : c € c, z c £ Z c } . In the latter case one can even 
deal with cases where the conditional dependency graph is (potentially) 
different for every instance. 

• An illuminating example of how to design kernels via the decomposi- 
tion in Proposition 20 is the case of conditional Markov chains, for which 
models based on joint kernels have been proposed in Altun et al. [6], 
Collins [30], Lafferty et al. [90] and Taskar et al. [132]. Given an input 
sequences X = pQ) tg r T i, the goal is to predict a sequence of labels or 
class variables Y = (Y t ) te m, Y t £ X. Dependencies between class variables 
are modeled in terms of a Markov chain, whereas outputs Yt are assumed 
to depend (directly) on an observation window (X t - r , ■ ■ • , X t , . . . , X t + r ). 
Notice that this goes beyond the standard hidden Markov model struc- 
ture by allowing for overlapping features (r > 1). For simplicity, we fo- 
cus on a window size of r = 1, in which case the clique set is given by 

■= { Ct ■= (x t ,yt,yt+i),c' t := (x t+ i,yt,yt+i)-t 6 [T - 1]}. We assume an 
input kernel k is given and introduce indicator vectors (or dummy vari- 
ates) I(Y{ t j +1 y) := (^^'(^{t.t+i}))^.^^^- Now we can define the local ker- 
nel functions as 



Notice that the inner product between indicator vectors is zero, unless the 
variable pairs are in the same configuration. 

Conditional Markov chain models have found widespread applications 
in natural language processing (e.g., for part of speech tagging and shallow 
parsing, cf. Sha and Pereira [122]), in information retrieval (e.g., for infor- 
mation extraction, cf. McCallum et al. [96]) or in computational biology 
(e.g., for gene prediction, cf. Culotta et al. [39]). 

4.2.3. Clique-based sparse approximation. Proposition 20 immediately 
leads to an alternative version of the representer theorem as observed by 
Lafferty et al. [89] and Altum et al. [4]. 

Corollary 22. If Ti is Q -compatible then in the same setting as in 
Corollary 13, the optimizer f can be written as 



kcd(z c ,z' d ) := (I(y{s, s+ i})I(y{t,t+i})) 



(85) 




if c = c s and d = ct, 
if c = c' s and d = c' t . 



n 



(86) 
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here the variables of xi belonging to clique c and y c is the subspace 

of Z c that contains response variables. 

• Notice that the number of parameters in the representation equation (86) 
scales with n ■ Y^ c & l^c| as opposed to n ■ \y\ in equation (77). For cliques 
with reasonably small state spaces, this will be a significantly more com- 
pact representation. Notice also that the evaluation of functions k C( i will 
typically be more efficient than evaluating k. 

• In spite of this improvement, the number of terms in the expansion in 
equation (86) may in practice still be too large. In this case, one can pursue 
a reduced set approach, which selects a subset of variables to be included 
in a sparsified expansion. This has been proposed in Taskar et al. [132] for 
the soft margin maximization problem, as well as in Altun et al. [5] and 
Lafferty et al. [89] for conditional random fields and Gaussian processes. 
For instance, in Lafferty et al. [89] parameters j3 l cyc that maximize the 
functional gradient of the regularized log-loss are greedily included in the 
reduced set. In Taskar et al. [132] a similar selection criterion is utilized 
with respect to margin violations, leading to an SMO-like optimization 
algorithm (Piatt [107]). 

4.2.4. Probabilistic inference. In dealing with structured or interdepen- 
dent response variables, computing marginal probabilities of interest or com- 
puting the most probable response [cf. equation (79)] may be nontrivial. 
However, for dependency graphs with small tree width, efficient inference 
algorithms exist, such as the junction tree algorithm (Dawid [43] and Jensen 
et al. [76]) and variants thereof. Notice that in the case of the conditional 
or hidden Markov chain, the junction tree algorithm is equivalent to the 
well-known forward-backward algorithm (Baum [14]). Recently, a number 
of approximate inference algorithms have been developed to deal with depen- 
dency graphs for which exact inference is not tractable (see, e.g., Wainwright 
and Jordan [150]). 

5. Kernel methods for unsupervised learning. This section discusses var- 
ious methods of data analysis by modeling the distribution of data in feature 
space. To that extent, we study the behavior of <3?(x) by means of rather sim- 
ple linear methods, which have implications for nonlinear methods on the 
original data space X . In particular, we will discuss the extension of PC A to 
Hilbert spaces, which allows for image denoising, clustering, and nonlinear 
dimensionality reduction, the study of covariance operators for the measure 
of independence, the study of mean operators for the design of two-sample 
tests, and the modeling of complex dependencies between sets of random 
variables via kernel dependency estimation and canonical correlation anal- 
ysis. 
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5.1. Kernel principal component analysis. Principal component analy- 
sis (PCA) is a powerful technique for extracting structure from possibly 
high-dimensional data sets. It is readily performed by solving an eigenvalue 
problem, or by using iterative algorithms which estimate principal compo- 
nents. 

PCA is an orthogonal transformation of the coordinate system in which 
we describe our data. The new coordinate system is obtained by projection 
onto the so-called principal axes of the data. A small number of principal 
components is often sufficient to account for most of the structure in the 
data. 

The basic idea is strikingly simple: denote by X = {x±, . . . ,x n } an re- 
sample drawn from P(x). Then the covariance operator C is given by C = 
E[(x — E[x])(x — E[x]) T ]. PCA aims at estimating leading eigenvectors of 
C via the empirical estimate C emp = E cmp [(ir — E emp [x])(x — E emp [a;]) T ]. If 
X is c?-dimensional, then the eigenvectors can be computed in 0{d?) time 
(Press et al. [110]). 

The problem can also be posed in feature space (Scholkopf et al. [119]) 
by replacing x with <£(x). In this case, however, it is impossible to com- 
pute the eigenvectors directly. Yet, note that the image of C emp lies in the 
span of {<&(cci), . . . , $(x n )}. Hence, it is sufficient to diagonalize C om p m 
that subspace. In other words, we replace the outer product C em p by an in- 
ner product matrix, leaving the eigenvalues unchanged, which can be com- 
puted efficiently. Using w = J27=i a i^{ x i)-, h follows that a needs to satisfy 
PKPa = Aa, where P is the projection operator with P^ = 5ij — n~ 2 and 
K is the kernel matrix on X. 

Note that the problem can also be recovered as one of maximizing some 
Contrast [f, X] subject to / € T. This means that the projections onto the 
leading eigenvectors correspond to the most reliable features. This optimiza- 
tion problem also allows us to unify various feature extraction methods as 
follows: 

• For Contrast [/, X] = Var emp [/, X] and T = {(w,x) subject to \\w\\ < 1}, 
we recover PCA. 

• Changing T to T = {(w,<&(x)) subject to ||u>|| < 1}, we recover kernel 
PCA. 

• For Contrast[/, X] = Curtosis[/, X] and JF= {(w,x) subject to \\w\\ < 1}, 
we have Projection Pursuit (Friedman and Tukey [55] and Huber [72]). 
Other contrasts lead to further variants, that is, the Epanechikov kernel, 
entropic contrasts, and so on (Cook et al. [32], Friedman [54] and Jones 
and Sibson [79]). 

• If T is a convex combination of basis functions and the contrast function 
is convex in w, one obtains computationally efficient algorithms, as the 
solution of the optimization problem can be found at one of the vertices 
(Rockafellar [114] and Scholkopf and Smola [118]). 
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Subsequent projections are obtained, for example, by seeking directions or- 
thogonal to / or other computationally attractive variants thereof. 

Kernel PCA has been applied to numerous problems, from preprocess- 
ing and invariant feature extraction (Mika et al. [100]) to image denoising 
and super-resolution (Kim et al. [84]). The basic idea in the latter case is 
to obtain a set of principal directions in feature space w\,...,Wi, obtained 
from noise-free data, and to project the image $(x) of a noisy observation x 
onto the space spanned by w\, . . . , wi. This yields a "denoised" solution &(x) 
in feature space. Finally, to obtain the pre-image of this denoised solution, 
one minimizes ||$(a/) — The fact that projections onto the leading 

principal components turn out to be good starting points for pre-image it- 
erations is further exploited in kernel dependency estimation (Section 5.3). 
Kernel PCA can be shown to contain several popular dimensionality reduc- 
tion algorithms as special cases, including LLE, Laplacian Eigenmaps and 
(approximately) Isomap (Ham et al. [60]). 

5.2. Canonical correlation and measures of independence. Given two sam- 
ples X,Y, canonical correlation analysis (Hotelling [70]) aims at finding di- 
rections of projection u,v such that the correlation coefficient between X 
and Y is maximized. That is, (u,v) are given by 

argmaxVarempf^jx)] -1 Var cmp [(v, y)]" 1 
u.v 

(87) 

x E cmp [(u,x - B cmp [x])(v,y - E emp [y])]. 

This problem can be solved by finding the eigensystem of C x 1 C xy C y , 
where C x , C y are the covariance matrices of X and Y and C xy is the co- 
variance matrix between X and Y, respectively. Multivariate extensions are 
discussed in Kettenring [83] . 

CCA can be extended to kernels by means of replacing linear projections 
(it, a;) by projections in feature space (tt,$(x)). More specifically, Bach and 
Jordan [8] used the so-derived contrast to obtain a measure of independence 
and applied it to Independent Component Analysis with great success. How- 
ever, the formulation requires an additional regularization term to prevent 
the resulting optimization problem from becoming distribution independent. 

Renyi [113] showed that independence between random variables is equiv- 
alent to the condition of vanishing covariance Cov[f(x),g(y)] =0 for all C 1 
functions f,g bounded by norm 1 on X and y. In Bach and Jordan 
[8], Das and Sen [41], Dauxois and Nkiet [42] and Gretton et al. [58, 59] a 
constrained empirical estimate of the above criterion is used. That is, one 
studies 

A(X, Y, F,Q):= sup Cov cmp [f(x),g(y)] 

(88) 

subject to / € T and g &G- 
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This statistic is often extended to use the entire series Ai , . . . , of maxi- 
mal correlations where each of the function pairs (fi,g%) are orthogonal to 
the previous set of terms. More specifically Douxois and Nkiet [42] restrict 

Q to finite-dimensional linear function classes subject to their L2 norm 
bounded by 1, Bach and Jordan [8] use functions in the RKHS for which 
some sum of the l\ and the RKHS norm on the sample is bounded. 

Gretton et al. [58] use functions with bounded RKHS norm only, which 
provides necessary and sufficient criteria if kernels are universal. That is, 
A(X, Y, J 7 , Q) = if and only if x and y are independent. Moreover, 
tr PK x PK y P has the same theoretical properties and it can be computed 
much more easily in linear time, as it allows for incomplete Cholesky factor- 
izations. Here K x and K y are the kernel matrices on X and Y respectively. 

The above criteria can be used to derive algorithms for Independent 
Component Analysis (Bach and Jordan [8] and Gretton et al. [58]). While 
these algorithms come at a considerable computational cost, they offer very 
good performance. For faster algorithms, consider the work of Cardoso [26], 
Hyvarinen [73] and Lee et al. [91]. Also, the work of Chen and Bickel [28] 
and Yang and Amari [155] is of interest in this context. 

Note that a similar approach can be used to develop two-sample tests 
based on kernel methods. The basic idea is that for universal kernels the 
map between distributions and points on the marginal polytope [i'.p — > 
E x ^ p [(f)(x)] is bijective and, consequently, it imposes a norm on distribu- 
tions. This builds on the ideas of [52]. The corresponding distance d(p,q) := 
\\fi[p] — fi[q]\\ leads to a [/-statistic which allows one to compute empirical 
estimates of distances between distributions efficiently [22]. 

5.3. Kernel dependency estimation. A large part of the previous discus- 
sion revolved around estimating dependencies between samples X and y for 
rather structured spaces y, in particular, (64). In general, however, such 
dependencies can be hard to compute. Weston et al. [153] proposed an algo- 
rithm which allows one to extend standard regularized LS regression models, 
as described in Section 3.3, to cases where y has complex structure. 

It works by recasting the estimation problem as a linear estimation prob- 
lem for the map / : — > $(y) and then as a nonlinear pre-image estima- 
tion problem for finding y := argmin J/ ||/(x) — &(y) || as the point in y closest 
to f{x). 

This problem can be solved directly (Cortes et al. [33]) without the need 
for subspace projections. The authors apply it to the analysis of sequence 
data. 

6. Conclusion. We have summarized some of the advances in the field 
of machine learning with positive definite kernels. Due to lack of space, 
this article is by no means comprehensive, in particular, we were not able to 
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cover statistical learning theory, which is often cited as providing theoretical 
support for kernel methods. However, we nevertheless hope that the main 
ideas that make kernel methods attractive became clear. In particular, these 
include the fact that kernels address the following three major issues of 
learning and inference: 

• They formalize the notion of similarity of data. 

• They provide a representation of the data in an associated reproducing 
kernel Hilbert space. 

• They characterize the function class used for estimation via the represen- 
ter theorem [see equations (38) and (86)]. 

We have explained a number of approaches where kernels are useful. Many 
of them involve the substitution of kernels for dot products, thus turning 
a linear geometric algorithm into a nonlinear one. This way, one obtains 
SVMs from hyperplane classifiers, and kernel PCA from linear PCA. There 
is, however, a more recent method of constructing kernel algorithms, where 
the starting point is not a linear algorithm, but a linear criterion [e.g., that 
two random variables have zero covariance, or that the means of two samples 
are identical], which can be turned into a condition involving an efficient 
optimization over a large function class using kernels, thus yielding tests 
for independence of random variables, or tests for solving the two-sample 
problem. We believe that these works, as well as the increasing amount of 
work on the use of kernel methods for structured data, illustrate that we 
can expect significant further progress in the years to come. 
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